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Abstract 

Large-scale applications are ever-increasingly geo-distributed. Maintaining the highest possible data locality is crucial 
to ensure high performance of such applications. Dynamic replication addresses this problem by dynamically creating 
replicas of frequently accessed data close to the clients. This data is often stored in decentralized storage systems such 
as Dynamo or Voldemort, which offer support for mutable data. However, existing approaches to dynamic replication 
for such mutable data remain centralized, thus incompatible with these systems. In this paper we introduce a write- 
enabled dynamic replication scheme that leverages the decentralized architecture of such storage systems. We propose 
an algorithm enabling clients to locate tentatively the closest data replica without prior request to any metadata node. 
Large-scale experiments on various workloads show a read latency decrease of up to 42% compared to other state-of- 
the-art, caching-based solutions. 
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1. Introduction 

Large-scale applications such as social networks are 
being increasingly deployed over multiple, geograph¬ 
ically distributed datacenters (or sites). Such geo¬ 
distribution provides fast data access for end-users 
worldwide while improving fault-tolerance, disaster- 
recovery and minimizing bandwidth costs. Today’s 
cloud computing services JT| [2) allow a wider range of 
applications to benefit from these advantages as well. 
However, designing geo-distributed applications is dif¬ 
ficult due to the high and often unpredictable latency 
between sites 0. 

Such geo-distributed applications span a large range 
of specific use-cases. For instance, cientihc applications 
such as the MonALISA ffl monitoring system for the 
CERN LHC Alice experiment 0. This application col¬ 
lects and aggregates monitoring data from 300+ sites 
distributed across the world, that must be delivered to 
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scientists worldwide in real-time. The users of com¬ 
mercial applications, such as Facebook, ever-increasing 
amounts of data that needs to be accessible worldwide. 
Ensuring the lowest possible access time for users is 
crucial for the user experience. 

A key factor impacting the performance of such ap¬ 
plications is data locality, i.e. the location of the data 
relatively to the application. Accessing remote data 
is orders of magnitude slower than using local data. 
Although such remote accesses may be acceptable for 
rarely-accessed data ( cold data), they hinder applica¬ 
tion performance for frequently-used data ( hot data). 
For instance, in a social network application, popular 
profiles should be replicated at all sites whereas oth¬ 
ers can remain located at fewer locations. Finding the 
right balance between replication and storage is criti¬ 
cal: replicating too many profiles wastes costly memory 
and bandwidth, while failing to replicate popular ones 
results in degraded application performance. 

Dynamic replication 0 proposes to solve this issue 
by dynamically replicating hot data as close as possible 
to the applications that access it. This technique is lever¬ 
aged in Content Delivery Networks ( CDN ) to cache im¬ 
mutable data close to the final user ||7][S). Similarily, it 





is used in storage systems such as GFS 0 orHDFS GO) 
to replicate mutable data, by relying on the centralized 
metadata management of these systems DUE). Yet, 
such an approach contradicts the design principles of 
decentralized storage systems such as Dynamo m or 
Voldemort G3. which aim to enable clients to locate 
data without exchanges with dedicated metadata nodes. 

Furthermore, handling mutable objects in the con¬ 
text is difficult. Indeed, the dynamic replicas have to 
be kept synchronized with the origin data, without im¬ 
pacting the consistency guarantees of the underlying 
system. To the best of our knowledge, no decentral¬ 
ized, write-enabled dynamic replica location and man¬ 
agement method exists in the literature today. Reach¬ 
ing this goal while providing predictable overhead and 
guaranteed accuracy is not trivial. In this paper we 
demonstrate that this objective can be reached by com¬ 
bining the architecture of this systems with deceptively 
simple algorithms from the literature. We make these 
contributions, which substantially revise and extend the 
early principles we previously introduced in 03: 

• After briefly introducing the applications we target 
(Section [2]), the related work (Section [3} and the 
storage systems we target (Section [4]), we char¬ 
acterize the challenges of decentralizing write- 
enabled dynamic data replication (Section[5]». 

• We address these challenges with a decentral¬ 
ized data popularity measurement scheme (Sec¬ 
tion |6|, which leverages existing state-of-the-art 
storage system architecture to identify hot data 
cluster-wide dynamically. 

• Based on these popularity measurements, we de¬ 
scribe a dynamic data replication algorithm 

which dynamically creates and manages replicas 
of hot data as close as possible to the applications 
(Section[7]i. 

• We enable clients to locate the closest of such 
data replicas using an approximate object loca¬ 
tion method (Section[8]>, which minimizes storage 
latency by avoiding communication with any ded¬ 
icated metadata node. 

• We develop a prototype implementation leverag¬ 
ing the above contributions, integrated with the 
Voldemort distributed key-value store (Section [9]), 
and prove the effectiveness of our approach with 
a large-scale experimental study on the Amazon 
Cloud (SectionfTO}. We observe a read latency de¬ 
crease of up to 42% compared to other state-of-the- 
art, caching-based algorithms. 


We discuss the effectiveness and applicability of our 
contribution (Section [TT|, and conclude on future work 
that further enhances our proposal (Section[l2)l. 

2. Large-scale, data-intensive applications 

In this paper we target large-scale applications serv¬ 
ing large amounts of data to users around the world, 
while seeking to enable low-latency access for these 
users to potentially-mutable data. Examples of such ap¬ 
plications span multiple use-cases, among which: 

Scientific system monitoring. Monitoring a geo- 
distributed cluster requires collecting potentially 
large number of metrics from computers around 
the world. This is for example the case for 
MonALISA 0, monitoring thousands of servers 
distributed in more than 300 sites around the 
world. The collected data is aggregated live, and 
is used to provide live insights about the system 
performance and availability around the world. 
To deliver the real-time promise of MonALISA, 
ensuring that the monitoring data that is needed by 
scientists around the world is located as close as 
possible to them is crucial. 

Social networks. Business applications such as social 
networks ingest overwhelming amounts of data. 
Facebook, for example, is expected to react 2 bil¬ 
lion active profiles in the next few weeks EH. 
Every single day, it processes 350 million photo 
uploads 1171 or 6 billion posts Ifl8l . The strong 
500:1 read to write ratio ED calls for large-scale 
caching. However, some of this data is muta¬ 
ble by nature. This is for example the case of 
user profiles, which are hard to cache while keep¬ 
ing this cache synchronized with the source data, 
calling for heavyweight, custom cache invalidation 
pipelines ll20l . 

The real-time promise of such applications requires 
to keep up-to-date data as close as possible to the end- 
user. The global scale of these applications makes this 
difficult while keeping costly bandwidth and storage us¬ 
age low. We will detail these challenges in Section [5] 
These challenges are independent of the type of plat¬ 
forms such as compute grids for MonALISA or clouds 
for Facebook. 

3. Related work 

In the literature, dynamic replication stands as a topic 
of interest for all applications requiring access to shared 
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data from many geo-distributed locations. Most of these 

contributions can be classified in two categories: 

Immutable data, decentralized management. A 

range of applications require to provide their users 
with fast and timely access to static resources 
such as images or videos. This is the case of most 
global internet applications, in which Content 
Distribution Networks ( CDNs ) help provide a 
good user experience by creating replicas of static, 
immutable data as close as possible to the clients 
that access it. 

Yet, CDNs are targeted at serving content directly 
to the final user. In this paper, we focus on allow¬ 
ing a geo-distributed application to access a geo- 
distributed data source with the lowest possible la¬ 
tency. 

Kingsy Grace et al. J7] provide an extensive sur¬ 
vey of replica placement and selection algorithms 
available in the literature. Among these, Chen et 
al. (8) propose a dissemination-tree based replica¬ 
tion algorithm leveraging a peer-to-peer location 
service. Dong et al. lITTI transform the multiple- 
location problem into several classical mathemat¬ 
ical problems with different parameter settings, 
for which efficient approximation algorithms ex¬ 
ist. However, they don’t consider the impact of 
replication granularity on performance and scala¬ 
bility. Wei et al. l22l address this issue by devel¬ 
oping a model to express availability as a function 
of replica number. This approach, however, only 
works within a single site, as it assumes uniform 
bandwidth and latency, which is not the case with 
the geo-distributed workloads that we target. In¬ 
spired by the P2P systems, l23l proposes an adap¬ 
tive decentralized file replication algorithm that 
achieves high query efficiency and high replica uti¬ 
lization at a significantly low cost. In Il24l . Mac- 
Cormick et al. enable storage systems to achieve 
balanced utilization of storage and network re¬ 
sources in the presence of failures, and skewed 
distributions of data size and popularity. Madi et 
al. [25J consider a wider usage of parameters in 
the context of data grids such as read cost or file 
transfer time. 

Mutable data, centralized management. However, a 
range of applications rely on mutable data. This 
is for example the case in MonALISA monitor¬ 
ing of the CERN LHC experiment (4) 0. In a 
web applications, this is observed with social net¬ 
work profile pages, status pages, comments on a 


news thread, or more generally services display¬ 
ing publicly user-generated content. In all these 
applications, we also observe that the data objects 
are mutable (changing aggregates from new mon¬ 
itoring events, user updating their profile, post¬ 
ing new statues or comments). Available geo¬ 
replications solutions available typically either re¬ 
quire the application to explicitly clear modified 
objects from distant caches, or leverage a central¬ 
ized replication manager that contradicts the de¬ 
centralized design of most state-of-the-art, geo- 
distributed data stores. Dynamic replication en¬ 
ables the geo-distributed storage system to repli¬ 
cate in near real-time the most requested objects 
as close as possible to the application instances ac¬ 
cessing them. 

Efficiently creating and placing replicas of hot data 
is not enough. Indeed, one needs to ensure as well 
that those replicas are kept in synchronization with 
the original data. This is usually the case of appli¬ 
cations relying on a globally-distributed file sys¬ 
tem. Overall, the proposed solutions in the litera¬ 
ture leverage the centralized metadata management 
of certain storage systems such as HDFS IfTOll or 
GFS 0 to allow the clients to locate the closest 
available replica of the data they want to access. In 
that context, Ananthanarayanan et al. na propose 
a popularity-based dynamic replication strategy for 
HDFS aimed at improving the performance of geo- 
distributed Map-Reduce clusters. Jayalakshmi et 
al. EH models a system designed to direct clients 
to the most optimal replica available. 

Our proposal enables writes to any given object 
in a decentralized, large-scale storage system to 
be transparently forwarded to the existing dynamic 
replicas of that object, without requiring explicit 
cache eviction requests from the application. In 
contrast, replication strategies adopted in CDNs 
such as Dynamic Page Caching l26l have a sub¬ 
stantially different target ; they focus on offering 
fine-grained caching based on configured user re¬ 
quest characteristics (cookies, request origin, ...), 
while still accessing the origin data replica for dy¬ 
namic, mutable objects. 

Replica selection algorithm Targeted work on replica 
selection prove that adopting a relevant data loca¬ 
tion algorithm can lead to significant performance 
improvements. Mansouri et al. na propose a 
distributed replication algorithm named Dynamic 
Hierarchical Replication Algorithm (DHR), which 
selects replica location based on multiple criteria 
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such as data transfer time and request-waiting- 
time. Kumar et al. Il28ll address the problem of 
minimizing average query span, i.e. the average 
number of machines that are involved in the pro¬ 
cessing of a query through co-location of related 
data items. C3 l29l goes even further by dynami¬ 
cally adapting replica selection based on real-time 
metrics in an adaptive replica selection mecha¬ 
nism that reduces request tail latency in presence 
of service-time fluctuations in the storage system. 

However, none of these contributions considers the 
case of mutable data stored in decentralized data stores, 
such as Cassandra lf30l or Voldemort lfl4fl . Facebook, 
for example, circumvents the issue by directing all write 
requests to a single data center and using a dedicated 
protocol to keep the cache consistent across other re¬ 
gions flT9l . In this paper, we fill this gap by en¬ 
abling efficient data replication of mutable data in geo- 
distributed, decentralized data stores. 

4. Background: The systems we target 

Let us first briefly describe the key architectural prin¬ 
ciples that drive the design of a number of decentralized 
systems. Dynamo lTT3Tl has inspired the design of many 
of such systems, such as Voldemort |[T4l . Cassandra |f30l 
or Riak (3TJ. In this paper we target this family of sys¬ 
tems, which are widely used in the industry today. 

Data model. Dynamo is a key-value store , otherwise 
called distributed associative array. A key-value 
store keeps a collection of values , or data objects. 
Each object is stored and retrieved using a key that 
uniquely identifies it. 

DHT-based data distribution. Objects are distributed 
across the cluster using consistent hashing l32l 
based on a distributed hash table ( DHT ), as in 
Chord (33). Given a hash function h(jc), the out¬ 
put range [h mj „, h max ] of the function is treated as 
a circular space (h min sticking around to h max ), or 
ring. Each node is assigned a different random ob¬ 
ject within this range, which represents its position 
on the ring. For any given key k, a position on 
the ring is determined by the result of h (k). The 
primary node holding the primary static replica of 
the object is the first one encountered while walk¬ 
ing the ring passed this position. To ensure fault- 
tolerance, additional static replicas are created at 
the time the object is stored. These are placed on 
the next r nodes following the primary node on the 


ring, r being the configured replication factor of 
the system (usually 2). 

P2P cluster state dissemination. The position of each 
node on the ring is advertised in the cluster us¬ 
ing a family of peer-to-peer (P2P) protocols: Gos¬ 
sip lf34l . Each node periodically disseminates 
its status information to a number of randomly- 
selected nodes and relays status information re¬ 
ceived from other nodes. This method is also used 
to detect and advertise node failures across the 
cluster l35l . 

Client request routing. By placing objects determin¬ 
istically in the cluster. Dynamo obviates the need 
for dedicated metadata servers. Clients are able 
to perform single-hop reads, i.e. address their re¬ 
quests directly to the nodes holding the data. This 
enables a minimal storage operation latency and 
higher throughput. Should a client address the re¬ 
quest to a node not holding the requested data, this 
node will forward the request directly to the cor¬ 
rect one. This correct node is determined using the 
ring state information disseminated throughout the 
cluster. 

Deterministically placing data objects and dissem¬ 
inating ring status in the cluster enables each node 
to route incoming client requests directly to a node 
holding the data. Operation latency is further re¬ 
duced by opening cluster state information to the 
clients so they can address their requests straight 
to the correct node, without any metadata server 
involved. 

We found our strategy on these design principles, 
which allow us to guarantee the correctness of the pro¬ 
posal we describe in this paper. We choose not to mod¬ 
ify the original static replication mechanism, offering 
the same data durability as the underlying system. We 
also do not change the server-side client routing mecha¬ 
nism, consequently guaranteeing that static replicas are 
always reachable. This allows us to focus on develop¬ 
ing an efficient heuristic that maximizes the accuracy of 
popular object identification, optimizes the creation and 
placement of dynamic replicas of such popular objects, 
and helps clients efficiently locating the closest of these 
replicas. 

5. Our proposal in brief: outline and challenges 

In this paper, we demonstrate that it is possible to in¬ 
tegrate dynamic replication with the existing architec- 
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ture of these storage systems, which enables us to lever¬ 
age their existing, built-in algorithms to efficiently han¬ 
dle read and writes in geo-distributed environments. 

Such dynamic replication seeks to place new copies 
of the hot data in sites as close as possible to the ap¬ 
plication clients that access it. To that end, we permit 
the clients to vote for dynamic object replicas to be cre¬ 
ated at a specific site. These votes are collected at each 
node and disseminated across the cluster so that the ob¬ 
jects which received the most votes (or popular objects) 
are identified by the storage system. Dynamic replicas 
of such popular objects are created at sites where they 
are popular, and deleted when their popularity drops. 
When trying to access an object, clients tentatively de¬ 
termine the location of its closest replica (either static 
or dynamic) and address requests directly to the node 
holding it. Such approach however raises a number of 
challenges. 

We acknowledge that using client votes has been 
proposed before in the context of replicated relational 
databases, specifically to ensure data consistency ll36l . 
Transposing this idea to decentralized storage systems 
poses a number of significant challenges that we address 
in this paper. Specifically, collecting client votes effi¬ 
ciently without using a centralized process requires us 
to propose a novel, fully-decentralized, loosely-coupled 
vote collection algorithm. While existing dynamic 
replication techniques leverage a centralized repository 
to direct client requests to the nearest available replica, 
we propose a technique allowing clients to tentatively 
locate the closest available replica without any prior re¬ 
quest to any of such repositories. 

5.1. Collecting and counting votes 

The goal of dynamic replication is to improve storage 
operation latency for the clients. Therefore, we need to 
design an efficient way to let the clients cast their votes 
for objects. 

Collecting votes also raises a major challenge. While 
determining the most voted-for objects at each node is 
straightforward and can be done efficiently, inferring 
from this the most popular objects cluster-wide is not 
an easy task. This problem is named distributed top-k 
monitoring. Sadly, most implementations in the litera¬ 
ture EllMlESliO 1 ] are centralized. 

We address both these issues by mixing an approxi¬ 
mate frequency estimation algorithm with the existing, 
lightweight Gossip protocol provided by storage system 
(Section[6]i. 


5.2. Tunable replication 

Determining when to create a new dynamic replica 
or delete an existing one based on the previous informa¬ 
tion is also challenging in a distributed setup. Indeed, it 
is necessary to bound the number of replicas to be cre¬ 
ated at each site without any node being responsible for 
coordinating the replicated items. Consequently, nodes 
must synchronize with each other before creating a new 
replica of any object. One could consider using a con¬ 
sensus protocol such as Paxos Bdl . However, we argue 
it would be an overkill for such a simple task as Paxos 
is by no means a light protocol ga¬ 
it turns out that the technique we use to solve the vote 
collection and counting challenge also provides all the 
information we need to solve this issue (Section[7]). 

5.3. Dynamic replica location 

We need to enable clients to locate dynamic replicas 
as they do for static replicas. Obviously, such replicas 
should also be placed at a predictable, deterministically 
chosen node. We achieve this using the DHT-based data 
distribution of the storage system (Section |7.1| >. To ac¬ 
cess the closest available replica, a client also needs to 
know whether or not a dynamic replica exists at a given 
site. Systematically probing nearby sites for available 
replicas would contradict the single-hop read feature of 
Dynamo. Also, this would significantly increase stor¬ 
age operation latency, consequently missing the point 
of dynamic replication which is precisely to reduce this 
latency. 

We demonstrate that this issue can be solved us¬ 
ing probabilistic algorithms (Section [8]). Our proposal 
builds on the solutions we adopt for the two afore- 
described challenges. 

6. Identifying hot objects with client votes 

In this section, we describe how to identify the most 
popular objects at each site. This is achieved in three 
steps: 

1. We describe an efficient way to allow the client to 
vote for an object to be replicated dynamically 
at a specific location (Section [6T| ). 

2. We maintain a local count of these votes at each 
node to identify the most voted-for objects (Sec¬ 
tion |6T2]). 

3. We disseminate and merge these votes through¬ 
out the cluster to provide each node with a vision 
of the most popular objects for each site, cluster¬ 
wide (Section [fO|>. 
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The identification method we describe in this section 
addresses the vote collection and counting challenge 
above. 


6.1. Client vote casting 

Clients vote for objects to be replicated dynamically 
at sites close to them. We name these sites preferred 
sites. This proximity can for instance express network 
latency, but also metrics such as bandwidth cost or avail¬ 
able computational power may also be considered. To 
this purpose, each client maintains a list of such pre¬ 
ferred sites, ordered by preference. 

We argue that existing read queries to the storage sys¬ 
tem provide an ideal base for vote casting, as clients 
intuitively vote only for objects they need to read. In 
contrast, objects being written-to are not good candi¬ 
dates for dynamic replication because of the synchro¬ 
nization needed to keep dynamic replicas in sync with 


static replicas; we discuss write handling in Section 7.3 


For every storage operation on an object, the client in¬ 
dicates in the request message its preferred sites for this 
object, i.e. the sites where the client would have pre¬ 
ferred a dynamic replica of the object to exist. Let us 
assume a client wants to read the object associated with 
the key key. The client sends the request to the closest 
node n holding a replica of that object. Say this node 
belongs to a site s. We detail the location of this clos¬ 
est node in Section [8] The client piggybacks the re¬ 
quest message with the list of the subset of sites having 
a higher preference than s in its list of preferred sites. 
Such request is interpreted by n as a vote for this object 
to be replicated on these sites. 


6.2. Node-local vote collection and hot object identifi¬ 
cation 

At each node, we want to know for each site the most 
voted-for objects. These are considered as candidates 
for dynamic replication. Each time a node receives a 
read request for an object identified by key, it records 
the vote for this object to be replicated on all sites indi¬ 
cated as preferred by the client. Let us first assume that 
we keep one counter per key and per site, which is incre¬ 
mented by 1 for each vote. We name site counters the 
set of key counters for a single site, and vote summary 
the set of site counters for all sites. In addition, if the 
object replica identified by key is a dynamic replica, we 
consider that the client implicitly votes for this replica to 
be maintained. As such, we also record the vote for key 
on the local site of the node receiving the request, i.e. 
the site the node belongs to. Algorithm [I] details these 
actions by a node receiving a read request from a client. 


However, the goal of this scheme is to adapt to fluc¬ 
tuating object popularity by replicating dynamically the 
objects having the highest popularity over a recent pe¬ 
riod of time. Consequently, we use successive voting 
rounds. We extract at the end of each round the most 
voted-for objects for each site, and create a new, empty 
vote summary for the subsequent round. The length of 
a round is a cluster setting: we discuss its value in Sec¬ 
tion 11.2 We synchronize these rounds across the clus¬ 


ter by using the local clock of each node. 

Keeping an exact vote summary for any given round 
is memory-intensive. It has a memory complexity of 
0(M * S), M being the number of objects voted-for in 
this round and S being the number of sites in the cluster. 
Such complexity is not tolerable as billions of objects 
may exist and be queried in the cluster. Luckily, we do 
not need to keep the vote count for all objects: we are 
only interested in knowing which are the most voted-for 
objects for each site. For each site, finding the k most 
frequent occurrences of a key in a stream of data (client 
votes) is a problem known as top-A counting. Multiple 
approximate, memory-efficient solutions to this prob¬ 
lem exist in the literature. In the context of our system, 
such approximate approaches are tolerable as it is not 
critical to collect the exact vote count for each object 
as long as the estimation of their vote count is precise 
enough and the set of objects identified as popular ac¬ 
curately captures the votes expressed by the clients. As 
such, we use as vote summaries a set of approximate 
top-A estimators, k being a configuration setting whose 
value is discussed in Section 111.11 We choose to use 
the Space-Saving algorithm |[43l as top-A estimator. It 
guarantees strict error bounds for approximate counts 
of votes, and only uses limited, configurable memory 
space. Its memory complexity is Oik). For any given 
site, the output of Space-Saving is the approximate list 
of the A most voted-for keys, along with an estimation 
of the number of votes for each. 

Any given node simultaneously maintains |S | active 
structures, one for each node in the cluster. Each time 
this node receives a request for an object v, for each pre¬ 
ferred site indicated in the request, the key of v is added 
to the corresponding active structure. Consequently, at 
any time, a node is able to know which are the most fre¬ 
quent replication preferences indicated by a client for 
any site over the previous time window. 


6.3. Cluster-wide vote summary dissemination 

In this section we explain how to obtain the most 
voted-for objects across all nodes, starting from local 
vote summaries built from user votes (1). We periodi¬ 
cally share the local vote summaries of each node with 
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Algorithm 1 Node-local object vote counting 

Input: key. key of an object to read, prefs : list of preferred sites provided by the client, 
procedure CountClient Votes ( key, prefs) 

> Interpret reading a dynamic replica as an implicit vote 

let local be the local site of the current node 

let replica be the local replica of the object with key key 

if the replica is a dynamic replica then 
add local to prefs 

end if 

> Add client votes to the local vote summary 

for each preferred site site in pref do 

let vs[site] be the site counter structure for site 
count one vote for key in vs[site] 

end for 

end procedure 


its peers (2). Merging these peer vote summaries (3) 
gives each node a view of the most popular items across 
the cluster. Figure [I] illustrates this process. 

We organize the process at any given node n in suc¬ 
cessive phases. During a voting round r of duration t, 
the local vote summary capturing client votes is named 
active summary. When a round ends, the summary tran¬ 
sitions to a merging state: the node sends this sum¬ 
mary to its peers, i.e. every other node in the cluster. 
Rounds being synchronized across the cluster, the node 
also receives summaries from its peers for the same vot¬ 
ing round, which are merged with the local summary; 
merging this local summary with another one received 
from a peer n' gives a summary of the votes received 
by both n and n'. When vote summaries for every peer 
have been received, the summary is complete, at which 
point all votes received by all nodes in the cluster for the 
round r are summarized. After a period 2 * t since the 
round started, this cluster-wide summary transitions to a 
seri’ing state which we detail in Section [7] We illustrate 
these successive vote summary phases in Figure [2] 

In presence of faults, a summary can reach the serv¬ 
ing state without having received all peer vote sum¬ 
maries in time. This may occur in case of delayed or 
lost packets. We qualify such summary as incomplete. 

Such an approach is consistent with the class of sys¬ 
tems we target: Dynamo provides an efficient algorithm 
for disseminating information across the cluster: Gos¬ 
sip. We use it to share a vote summary with every 
other node when it reaches the merging state. This ap¬ 


proach is also compatible with our design choices: the 
Space-Saving structure we use is proven to be merge- 
able in (44], with a commutative merge operation. 

Formally, we name MergeCounters(cz, b) the func¬ 
tion outlined in (44l that merges two Space-Saving 
structures a and b. MergeSummaries(v, v') is the func¬ 
tion merging two vote summaries s and s' . These sum¬ 
maries contain site counters, respectively vi,...,vs and 
vj,..., v'.. S is the total number of sites in the cluster. 
This function returns a merged summary v" containing 
S site counters v",..., v' s ', such that: 


Va £ [1,5], vc" a = MergeCounters(vc 0 , VC’a) (1) 


Considering that MergeCounters is commutative, it is 
trivial that MergeSummaries has the same property. 


Let us assume a reliable network at this point, with 
all peer summaries being received before the local sum¬ 
mary reaches a serving state. Because each node sends 
to all its peers the same local vote summary, and because 
the MergeSummaries function is commutative, the re¬ 
sulting complete summary after all peer summaries are 
merged is identical at each node. When all nodes reach 
the complete summary state, they share the same view 
of the most voted-for objects for each site. We use it 
to perform dynamic object replication in Section]?] We 
discuss the memory complexity of the popular object 


identification process in Section 11.4 
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Figure 1: Cluster-wide popular object identification overview. S is the total number of sites in the cluster and N the total number of nodes. 
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Figure 2: Timeline of vote summary states for a voting round length t. 


7. Lifecycle of a dynamic replica 

We detail in this section how to create and delete dy¬ 
namic replicas using the cluster-wide vote summaries 
while handling writes to dynamically-replicated objects. 

1. We first explain when to create a new dynamic 
replica of an object identified as popular on a site 
(Section |7.1| >. 

2. We describe the process of removing those data 
replicas when their popularity popularity drops 
(Section |7.2| i. 

3. We finally explain the process of forwarding writes 
to these dynamic replicas while retaining under¬ 
lying storage system consistency and guarantees 
(Section |7.3| l. 

7.1. When and where to create a dynamic replica? 

With access to a shared vote summary, deciding when 
to create a replica is straightforward. Nodes in the clus¬ 
ter create remote dynamic replicas of the popular ob¬ 
jects they are primary node for. As soon as a vote sum¬ 
mary reaches the complete state, thus summarizing the 
votes of all clients across the cluster, the top-/: most pop¬ 
ular objects are replicated to sites at which they are pop¬ 
ular. To replicate an object obj identified by a key key 


on a remote site s, a node n first informs all nodes hold¬ 
ing static replicas of obj , which store this in their local 
state. Upon acknowledgement from these static replica 
nodes, the primary node updates its local state as well, 
and copies obj to a node at site ,y. The node on which 
this replica is placed is selected deterministically. In the 
case of Dynamo, we can use the existing DHT to place 
objects at a site in such a deterministic fashion. Starting 
on the cluster ring from the position h (key), we walk the 
ring until we find a node at s, on which the replica is 
placed. Such a method is used today by Cassandra El 
for rack-aware data placement. Thus, assuming that a 
client knows a dynamic replica of an object exists at a 
site, it can easily infer which node holds this replica and 
address its request directly to it. 


Fault-tolerance: No replicas are created based on 
incomplete vote summaries. In the presence of failures, 
this may result in dynamic replicas of yet popular ob¬ 
jects not being created at the initiative of its primary 
node. We handle this case with the replica read process 
we outline in Section [5] 
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7.2. When to delete a dynamic replica? 

Each node is responsible for the deletion of any dy¬ 
namic replicas it holds. A dynamic replica at a site s can 
be deleted if it is not among the top -k items for this site ,v 
in the serving summary for the current time period. We 
also want to avoid replica bounces, i.e. object replicas 
being repeatedly created and deleted at the same site. 
This may happen for objects whose popularity ranking 
is around the top-k threshold, and fluctuates above and 
under this threshold. We define a grace period g, which 
represents the minimum number of consecutive serving 
summaries a previously-popular object must be absent 
from before its dynamic replica is deleted. 

The deletion process is simple. To delete a locally- 
held dynamic replica of an object obj , a node flags it as 
inactive: subsequent client requests for that replica are 
as if it does not exist, which we discuss in Section [8] 
The node then informs the static replica nodes of obj 
of the deletion of this replica, which they remove from 
their local state. 

Fault-tolerance: No replicas are deleted based on 
incomplete vote summaries. Replicas of objects whose 
popularity dropped may not be deleted as they should in 
the presence of failures. Yet, this principle ensures that 
no replicas of still popular objects are ever deleted, and 
remain accessible. 

7.3. Handling writes to dynamically replicated objects 

Our dynamic replication scheme enables the stor¬ 
age system to handle writes to dynamically-replicated 
objects. The afore-detailed dynamic replica creation 
makes this process straightforward. Clients address 
write requests for any object to one of the static replica 
nodes of this object. Based on their local state, these 
nodes determine all existing dynamic replicas of that 
object and propagate the write to all other replicas, static 
and dynamic, using the write protocol of the storage 
system. 

Correctness: because our replication system does 
not modify the write propagation algorithm of the un¬ 
derlying system, writes are propagated to dynamic repli¬ 
cas the same way they are propagated to static repli¬ 
cas. In Voldemort, writes are eventually-consistent by 
default, and can optionally be made strongly consis¬ 
tent. Our proposal shows the same consistency char¬ 
acteristics. As such, we only need to ensure that the 
creation and deletion of dynamic does not violate these 
consistency characteristics. We ensure this by persist¬ 
ing the new dynamic replica list for each object on its 


origin nodes before creating and after deleting any dy¬ 
namic replica. This guarantees that writes to dynami¬ 
cally replicated objects are always forwarded to all of 
its dynamic replicas. 

We prove that this write protocol is correct, i.e. does 
not cause dynamic replicas to be out of sync with static 
replicas, even in the case of system failures. A dy¬ 
namic replica is created only after successful acknowl¬ 
edgement from all other static replica nodes. These 
nodes only are informed of the replica deletion after it is 
flagged as inactive. This ensures that writes to an object 
are always propagated to the dynamic replicas of that 
object, even in the presence of faults. 

8. Accessing the closest replica 

In this section, we explain how to let clients access a 
close replica of objects they want to read, either static 
or dynamic, without any communication with any ded¬ 
icated metadata node. In a nutshell, we let the user re¬ 
quest information about dynamic replicas created on its 
preferred sites at any given time, and later use this data 
to infer the location of the closest replica of any object 
and access it directly. 

8.1. Locating the closest replica: dynamic replica sum¬ 
maries 

We assume a client can know at any time the list of 
all active dynamic replicas in its preferred sites. This 
assumption greatly eases locating the closest replica: 
it is the static or dynamic one located on the site with 
its highest preference, or the closest static replica if no 
replica exists at any preferred site. We name this list 
of dynamic replicas for a site s dynamic replica sum¬ 
mary of s, and detail the closest replica location in Al¬ 
gorithm]^ The client addresses its request to the node 
holding this replica on the site indicated by this algo¬ 
rithm using its knowledge of the cluster DHT. The node 
receiving this request returns the dynamic replica, if 
available. If it is not available locally, it forwards the 
request to the closest static replica node, which is guar¬ 
anteed to hold the object. 

We near these assumptions by enabling any client to 
request from any node such dynamic replica summary 
at its preferred sites at the time of the request. Any node 
is able to answer this request based on the cluster-wide 
vote summary dissemination process. Using its knowl¬ 
edge of the client votes across the cluster given by the 
current complete, serving summary, the node knows the 
list of all dynamic replicas currently active at any site. 
This list is sent to the client for its preferred sites. 
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Algorithm 2 Closest replica site inference 

Input: key. key of an object to read, S: preferred sites list 

function iNFERCLOSESTREPLICASrTE(kev) 

for each site 5 in 5 do 

let /?[s] be the dynamic replica summary for s 
if s is a static site for key or key in R[s] then 
return s 
end if 
end for 

return the closest static replica of val 

end function 


Luckily, the client request routing provided by Dy¬ 
namo enables us to only infer the location of the closest 
available replica without jeopardising the read protocol 
correctness. In case of inference error, the client will 
address its request to a wrong node, but this node will 
forward it to a static replica node which is guaranteed to 
hold the object. This enables us to use an approximate 
structure as a memory-efficient way to represent this list 
of dynamic replicas: Bloom filters [451. 


8.2. When to refresh dynamic replica summaries? 

Intuitively, clients need to periodically refresh dy¬ 
namic replica summaries to account dynamic replicas 
being created and deleted. We provide a simple yet ef¬ 
ficient way to let a client decide of the appropriate time 
to do so: error indication. 

Two different types of errors can be caused by our 
closest replica location algorithm: false positives - 
when a node is wrongly believed to hold a replica of the 
object, and false negatives - when the request is sent to 
a non-optimal replica of the object, i.e. when a closer 
replica existed according to the preferred sites of the 
client. A false positive happens if the replica summary 
is outdated and the replica was deleted since its last up¬ 
date, or in case of a false positive caused by the Bloom 
filter. A false negative occurs if the replica summary 
was last updated before this replica was created. 

Nodes in the server are able to indicate to clients such 
false positives (if the node being sent a read request does 
not hold the requested dynamic replica) or false neg¬ 
atives (based on the latest serving vote summary they 
hold) by flagging the response to a request accordingly. 
Clients keep a local counter of such inference errors and 
decide to refresh their dynamic replica summaries and 
reset this counter when it reaches a configured thresh¬ 


old. We discuss its configuration in Section 11.3 


To cope with cases where the configured summary 
validity period is too large for some usage patterns, the 


client keeps a count of false positives or false negatives. 
Should the client address the request to a server not 
holding the desired piece of data, the server relays the 
query as usual to the correct node but flags the answer as 
erroneous. This may happen for instance because it has 
been deleted after the client bloom filter has been last 
updated. We call this case a false positive. Inversely, if 
the client accesses a remote object replica and a closer 
one existed, which we call a false negative , the server 
responding to the client flags the response message as 
well. A counter of wrong assumptions is maintained 
by each client, and set to 0 every time the summary is 
updated. For each false positive or negative, the client 
increments its counter. It updates its bloom filter as soon 
as the counter reaches a configurable threshold. 

9. Prototype implementation 

We implement our approach atop a real-world key- 
value store. Voldemort 03 is an open-source clone 
of Dynamo ECO- It is developed by Linkedln and is 
extensively used in their software stack ESI . Volde¬ 
mort is a modular Java application, making it a relevant 
choice for our experiments. The total of our additions 
account for about 2,800 lines of Java code, excluding 
open-source Space-Saving and Bloom Filter libraries. 
We modify both the server code to handle decentralized 
object replication (Section |9.1[ ) and the client code to 
perform close replica location (Section |9.2[ i. 

9. 1. Server-side modifications 

Our server-side modifications are designed as a mid¬ 
dleware. We extensively use the native functionality 
provided by Voldemort to implement our prototype: 
Gossip and local persistent storage interface. Incom¬ 
ing client requests are first processed by this middle¬ 
ware to extract site preference information that is used 
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to update the local popularity measurement structure 
(Section [6. 2 \ . That structure is disseminated using the 
Voldemort gossip protocol (Section |63] i. A background 
thread is responsible for maintaining the cluster-wide 
popularity measurement structures using the informa¬ 
tion received from other nodes. This information is 
used to create or delete remote, dynamic object repli¬ 
cas. Such replicas are stored locally using Voldemort 
persistent storage. 

The most tricky and challenging part of the imple¬ 
mentation concerns write operations. We have to sub¬ 
stantially modify write handling so that a node holding 
a dynamic replica of a data object behaves as if it held 
a static replica of that object, forwarding the request to 
all other nodes holding replicas of that object according 
to its local state as detailed in Section 1731 


9.2. Client-side changes 


Modifications on the client were kept to the mini¬ 
mum. Each client maintains a statically-configured list 
of site preferences. Each storage operation request for 
any given object is routed according to our replica lo¬ 
cation algorithm outlined in Section 8.1 If no dy¬ 


namic replica for this object can be found in the local 
Bloom filter, the request is routed according to Volde¬ 
mort vanilla algorithm. Site preferences are piggy¬ 
backed to the request. Incoming responses are checked 
for false positive or false negative flags using a specif¬ 
ically designed middleware in order to update the local 
error counter described in Section Iik2l 


10. Experimental evaluation 

In this section we prove the effectiveness of our ap¬ 
proach using our prototype. We perform this in 4 steps: 

1. We first show that our object heat measurement 
technique presented in Section [6] is able to prop¬ 
erly identify the hottest objects in the cluster (Sec¬ 
tion [10T]|. 

2. We demonstrate that the object replication method 
described in Section IT] effectively replicates these 
hottest objects (Section [l0.2| i. 

3. We prove that our object location technique intro¬ 
duced in Section[8]is accurate (Section [l0.3| ). 

4. We finally confirm that the combination of these 
principles applied to Voldemort efficiently reduces 
the average read latency under different workloads 
when compared to other caching-based approaches 
(Section |10.4| ). 


Experimental platform. We deploy our prototype on 
96 nodes of the Amazon EC2 cloud. For all our 
experiments, we use t2.large general-purpose in¬ 
stances, evenly distributed over 6 sites spanning 
4 continents: California, Virginia, Ireland, Ger¬ 
many, Australia, and Japan. Each virtual ma¬ 
chine has access to 2 CPU cores and 8 GB RAM. 
The host server is outfitted with 10 Gigabit eth- 
ernet connectivity. Our measurements show the 
main bottleneck of these virtual machines to be 
the CPU. Single-site experiments on bare-metal 
servers show higher throughput than Amazon EC2. 

Dataset and workload. The main workloads we target 
are dominated by reads, such as the applications 
described in the motivation section above. Yet, we 
prove that even for write-intensive workloads, our 
data replication system increases average read per¬ 
formance at the cost of a slightly increased write 
latency due to the added cost for dynamic replica 
synchronization. To that extent, we use YCSB 1471 
to generate both our initial dataset and storage op¬ 
erations. YCSB is an industry-standard bench¬ 
mark commonly used to evaluate the performance 
of key-value stores. We generate 500 million 1 KB 
records, that we insert in Voldemort as our initial 
data. The workload against the cluster is gener¬ 
ated by two YCSB instances on each site (12 in 
total) running in their own virtual machine. Re¬ 
quests are generated according to a Zipfian dis¬ 
tribution, with Zipf parameter p = 0.99, drawing 
from a set of 20 million keys. Each measurement 
is performed 50 times at maximum throughput for 
a period of 5 minutes. Bar plots represent averages. 
To account for changing hot object set, YCSB in¬ 
stances are restarted in turn - one every 30 sec¬ 
onds. This leads to a varying request distribution 
over time. We collect all measurements on a sta¬ 
ble, non-saturated cluster to ensure that our mea¬ 
surements do not suffer from compute resource ex¬ 
haustion on the server. Requests are locally throt¬ 
tled at each node to a maximum of 1,000 concur¬ 
rently running queries. 

Experimental configuration. Our prototype uses the 
default time window length of 10 seconds, a grace 
period of 3, a k value of 10,000, and a client error 
threshold of 5%. Client-side site preferences are 
identical for every node in the same site. In the 
applications we target, both database clients and 
servers are located in a well-defined datacenter. 
Consequently, the preference list for each client 
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Table 1: Experimental client preferred site settings 


Site 

1 st preference 

2 nd preference 

3 rd preference 

California (us-west-i) 

California (O.s ms) 

Virginia (72.8 ms) 

- 

Virginia ( us-east-1> 

Virginia (o.s ms) 

California (72.8 ms) 

Ireland ( 87.2 ms) 

Ireland ( eu-east-i) 

Ireland ( o.7ms) 

Germany ( 27 .-/ ms) 

Virginia ( 87.2 ms) 

Germany (eu-central-1) 

Germany (O.&ms) 

Ireland ( 27 .-/ ms) 

- 

Australia (ap-southeast-2) 

Australia (0.9 ms) 

- 

- 

Japan (ap-northeast-l) 

Japan ( o.7ms) 

- 

- 
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Figure 3: Popular object identification performance. 


can be statically configured - showing no signifi¬ 
cant client overhead. Preferred sites are ordered by 
average measured round-trip latency, up to a max¬ 
imum of 100 ms. They are configured as indicated 
in Table]!] - each site being its own first preference. 
We keep Voldemort default configuration settings. 

Evaluated cache algorithms. We compare our ap¬ 
proach to cache-based algorithms: Least Recently 
Used (LRU), Most Recently Used (MRU), Ran¬ 
dom Replacement (RR), and Adaptive Replace¬ 
ment Cache (ARC) Ii48l . We implemented these 
algorithms in Voldemort, so that each clients con¬ 
tacts the local server that would hold the requested 
piece of data (as explained in Section [H] if cached. 

10.1. Popular objects identification 

We show in this section the behavior of our popular¬ 
ity measurement scheme introduced in Section [6] We 
run a read-only workload, as popular objects are only 
identified based on client read queries. We log for each 
YCSB instance the object keys requested. Collecting 
this information from every request generator enables 


us to calculate the exact set of k most frequent popular 
objects for each site across the cluster at any given time. 
We also log on each node the cluster-wide popularity 
summaries, at a one-second interval. Combining those 
logs with the aforementioned request logs enables us to 
derive a key metric: identification performance. Iden¬ 
tification performance corresponds to the proportion of 
identified popular objects at any given time compared 
to our exact, offline computation. A popular object is 
considered identified if it is present in the cluster-wide 
object heat summaries of all nodes of the cluster. This 
proportion is measured in percents, 100% indicating a 
perfect match between theoretical and practical results. 

We plot our results in Figure [3] Overall, they show 
that more than 95% of the hot couples are identified 
by our heat measurement and dissemination scheme. 
We can observe a regular identification performance 
drop of about 8%, repeated every 30 seconds. Each 
of these drops corresponds to one request generator be¬ 
ing restarted, leading to a sudden change of the popu¬ 
lar keys being requested by that generator. For each of 
these events, we note that the cluster convergence time 
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stays in the order of 12 seconds. This specific time is 
explained by our popularity measurement time window 
configured to 10 seconds, plus the Gossip protocol con¬ 
vergence time for the popularity measurements dissem¬ 
ination cluster-wide, which we measured to take about 
two seconds. 

We note clear variability in our measurements. This 
is caused by the rapid fluctuation of popularity compu¬ 
tations near the limit of the top-k estimator we use in 
our popular data identification scheme. We made sure 
of this by plotting the identification performance for the 
10% most popular objects. We observe that this blue 
curve is much more stable than the curve for all val¬ 
ues, meaning that the most popular objects requested by 
clients are identified accurately. 

We ran the same experiments with object sizes of 
100KB, 1 MB, 10 MB. The results is a replication time 
increase of respectively 1.2%, 3.8% and 6.2% compared 
to 1 KB size, due to the added network bandwidth re¬ 
quired for the data transfers. 

It is interesting to note that our measurements show 
that our hot object identification scheme never reaches 
100% accuracy. This is because of the unstable fre¬ 
quency estimations for keys near the detection limit of 
the Space-Saving structure, which changes significantly 
faster than the convergence time of the cluster. This also 
explains the noticeable variability of our measurements. 

10.2. Dynamic replica creation and deletion 

In this experiment we show the replication perfor¬ 
mance achieved by the principles we introduced in Sec¬ 
tion [7] As for popular object identification, only read 


operations are considered for replication: we run a read¬ 
only workload. We execute it for a duration of 5 minutes 
and calculate at each site the number of replicated ob¬ 
jects. We plot the results we obtain in Figure [4] with 
95th percentile confidence intervals. The red curve in¬ 
dicates the average number of active replicas per site 
at any given time. The blue curve indicates, among all 
created dynamic replicas, the average number of repli¬ 
cas which are correct, i.e. for which the object is present 
in the offline-computed, theoretical top-k for the corre¬ 
sponding site calculated as in the previous experiment. 

We first focus on the correct replica count curve, in 
blue. The regular drops are caused by YCSB clients be¬ 
ing restarted at regular intervals, which impact the dis¬ 
tribution of the key requests as previously observed. We 
note that this drop is only temporary, and that the cor¬ 
rectness of the replicas returns to normal quickly, after 
new replicas are being created, within approximately 12 
seconds as in the previous experiment. We show that our 
replication technique replicates a little more than 95% 
of the actual top-k most popular objects, which is coher¬ 
ent with the popular object identification performance 
we could observe in the previous experiment. 

We note on the red curve that the count of active dy¬ 
namic replicas is relatively stable at about 10,600 per 
site on a stabilized state. This is slightly more than the 
configured value of k (10,000) because of the grace pe¬ 
riod of 3, which causes replicas to be deleted with a 
slight delay when their popularity drops. We also no¬ 
tice that this curve stays at all times above the con¬ 
figured k, because the top-k items as identified by the 
nodes are always replicated. The number of active repli- 
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Figure 5: Client location error rates. 


cas increases slightly after each YCSB client restart be¬ 
cause of the grace period, which causes a number of 
new replicas being created without deleting immedi¬ 
ately the previously-popular objects. Once the grace pe¬ 
riod has elapsed, these dynamic replicas are removed, 
which causes the total number of replicas to decrease 
after these spikes. 

10.3. Replica location 

We show the accuracy of our replica location scheme 
CSection[8j using a read-only workload. We measure the 
afore-mentioned false positive and false negative rates 
at the clients. We plot the results in Figure[5] 

Let us focus first on the false negative curve, drawn in 
red. The spikes observed at regular intervals are caused 
by the restart of YCSB clients, which changes the distri¬ 
bution of the requested keys consequently causing new 
replicas to be created. Clients are only informed of these 
new replicas with error indication from the server nodes, 
explaining the rapid increase in false negative rate. As 
soon as this rate crosses the set threshold (5%), the 
clients update their local summary, causing the number 
of false positives to decrease rapidly. The first spike oc¬ 
curs at about 50 seconds, while the first client restart oc¬ 
curs at 30 seconds. This is because of the two 10-second 
rounds required for a summary to reach a serving state. 

The false positive curve, plotted in green, shows the 
same patterns. Spikes are caused by dynamic replicas 
being deleted from the cluster, after the end of the grace 
period. As with false negatives, error indication enables 
clients to detect these errors and update their summaries 
within less than 10 seconds. 


10.4. Evaluating the whole strategy: latency impact 

In this section we evaluate the latency impact of 
our solution. To this end, we use two YCSB work¬ 
load patterns to evaluate our scheme: read-heavy (95% 
reads - 5% writes) and update-heavy (50% reads - 50% 
writes). Due to the lack of dynamic replica place¬ 
ment systems to compare with, we choose to implement 
cache-based algorithms at all sites: Least Recently Used 
(LRU), Most Recently Used (MRU), and Adaptive Re¬ 
placement Cache (ARC) (48)- Such edge-caching is 
one commonly-used solution to improve read latency 
of globally-distributed storage systems, as, for example, 
at Facebook ED- We keep unmodified Voldemort as a 
baseline. We implemented these algorithms in Volde¬ 
mort, so that each clients contacts the local server that 
would hold the requested piece of data (as detailed in 
Section |7T) if cached. We do not implement writes for 
these caching algorithms, as they are not designed for 
that purpose. Using the original Voldemort algorithm 
we simulate writes to different keys than the ones being 
read. 

We plot in Figure [6] the measured read latency of 
all five systems under a sustained read-heavy workload 
over time, with 95th percentile confidence intervals. It 
can be observed that the read latency achieved without 
caching or with MRU caching is stable, and is not af¬ 
fected by the changes in the distribution of requested 
keys caused by YCSB client restarts. We note that MRU 
causes higher latency than no caching at all. This is ex¬ 
plained by our workload, which is the worst possible 
case for that algorithm: the most recent keys are the 
more likely to be requested. Yet, such caching prevents 
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Figure 6: Latency over time with a 95% read / 5% write workload. 


single-hop reads. Our dynamic replication scheme per¬ 
forms globally better than both ARC and LRU, showing 
lower latency spikes at YCSB client restarts. This is be¬ 
cause we enable locating dynamic replicas at close sites 
while caching only allows site-local caching. The sim¬ 
ilarity between LRU and ARC curves is related to the 
fact that the latter is built atop the former. 

Figure [7] shows the impact of the workload on the la¬ 
tency of respectively reads and writes of all systems. 
We implemented writes on ARC, LRU and MRU using 
asynchronous cache eviction technique as implemented 
at Facebook ED- As a write is received on any given 
key, an asynchronous event is triggered and broadcasted 
to every datacenter in order to evict this key, if cached, 
from the local cache. 


On Figure 7a we show that read latencies tend to 
increase as the ratio of reads over writes decreases, 
but keep the same relative order. We observe that the 
read performance advantage of our method, previously 
observed with a 95% / 5% read-to-write ratio is pre¬ 
served even when this ratio decreases. We also note 
that unmodified Voldemort and our dynamic replication 
method show only marginally increasing latencies as the 
read-to-write ratio decreases. This is because of the rel¬ 
atively low overhead of the write protocol, which only 
broadcasts the write notifications to datacenters effec¬ 
tively holding a replica of the keys being written to. 


Figures [Tbjdetails this last point, showing the write la¬ 
tency of our dynamic replication method being slightly 
higher than the one of unmodified Voldemort. This is 
due to the additional bookkeeping required to keep dy¬ 
namic data replicas synchronized with static replicas. 


The latency showed by other caching methods is compa¬ 
rable, as the write protocol is the same for each method. 
However, the overhead of broadcasting write events to 
every datacenter significantly hinders the write latency. 

This demonstrates that although we targeted our geo¬ 
replication method to read-intensive workloads with 
rare writes, the read latency is also significantly de¬ 
creased compared to unmodified Voldemort and edge 
caching-based approaches for balanced and even write¬ 
intensive workloads, at the cost of a sensible write la¬ 
tency increase. 


11. Discussion 


11.1. Bounding the number of replicated objects: 
choosing k 

The number of replicated objects at a site s is strongly 
tied to the configuration of k for that site, that we name 
k s . This is because all topA v voted-for objects are repli¬ 
cated dynamically at s. We deduce that, at any time, 
there are at least k s elements replicated on site s. Dy¬ 
namic replicas are deleted when their popularity drops 
under the top-fcj threshold. Assuming that A such that 
0 < A < 1 is the average replacement ratio of new ele¬ 
ments in the top-f- of any round compared to the previ¬ 
ous one, the number R s of dynamic replicas available at 
site s is: 

Rs = k s + {k s * Tg) (2) 


We extract from this equation the value of k s for an 
available dynamic replication capacity C s at site s such 
that R s < C s : 


k s < 


C s 

Ag + 1 


(3) 
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(b) Workload write latency impact 
Figure 7: Latency impact of the workload. 
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11.2. Configuration of the round length 

The voting round length is an essential parameter for 
our replication scheme. It obviously impacts the object 
detection latency between the moment when an object 
is becoming popular at the clients and the time when 
it is identified as such by the cluster and replicated. 
The lower the round length, the faster the identifica¬ 
tion. However, two things are crucial to consider when 
choosing its value. First, a low value will cause vote 
summaries to be disseminated at more regular intervals 
between nodes, increasing the network overhead. But 
more importantly, the vote round length should be set 
to be much higher than the Gossip convergence time of 
the cluster. Failing to consider this parameter results in 
more frequent incomplete summaries at the nodes be¬ 
cause of delayed network packets, and consequently in 
poor replication performance. 

11.3. Keeping client errors low: setting the error 
threshold 

The error threshold is set up as a ratio of errors to the 
number of requests processed. It has to be configured 
relatively to the performance requirements of the appli¬ 
cation accessing data. The lowest the expected latency, 
the lowest this ratio. In our experiments on replica lo¬ 
cation (Section [10.3| >, we show that a set value of 5% 
allows nodes to detect quickly new replicas while not 
updating replica summaries too often. Good balance be¬ 
tween these two parameters is essential: setting an error 
threshold too low may be counter-productive because 
of the additional network overhead and computational 
power required by summary refreshes. 

11.4. Memory complexity 

Hot object identification algorithm The memory 
complexity of Space-Saving structures is con¬ 
figurable: 0(k). We deduce that the memory 
complexity of our popular object identification 
algorithm is 0(k *N *S). Evaluation on a 96-node 
cluster gives good results. We plan to further 
explore its worst case scenario: thousands of 
nodes, or more. 

Location algorithm Bloom biters are a memory- 
efficient way to test element membership in a list. 
It has a configurable error, impacting its accuracy. 
We choose an error probability of 0.001. With 
k = 10,000 , the size of a dynamic replica summary 
for a single site is under 18 KB. 


11.5. Bounce period efficiency 

In our experiments with a 95% / 5% read/write ra¬ 
tio, we observed that disabling the bouncing period (set¬ 
ting it to 1) increased network traffic by an average of 
5%, because of some objects showing a popularity near 
the threshold to be unnecessarily, repeatedly created and 
deleted ; further increasing the bouncing period leads to 
no substantial network traffic reduction. 

12. Conclusion 

In this paper we present a decentralized dynamic 
replication strategy that integrates with the existing ar¬ 
chitecture of existing state-of-the-art storage systems, 
which enables us to leverage their existing, built-in al¬ 
gorithms to efficiently handle geo-distributed read and 
writes. We allow the clients to efficiently locate a close 
dynamic replica of the data without prior communica¬ 
tion with a dedicated metadata server. We deploy a pro¬ 
totype implementation of these methods on 96 nodes 
distributed over 6 locations of the Amazon EC2 cloud. 
Experiments show a latency improvement of up to 42% 
compared to other state-of-the-art, caching-based solu¬ 
tions. 

We are working on further enhancing our ap¬ 
proach. This includes designing a write-aware replica¬ 
tion scheme better able to handle balanced read/write 
workloads, improving the horizontal scalability of the 
popular object identification scheme, and generalizing 
it to a variety of different storage systems. Finally, we 
want to explore the behavior of our proposal with a 
wider range of parameters and workloads, and formal¬ 
ize the configuration trade-offs. 
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